Genome analysis of biosurfactant producing bacterium, Bacillus tequilensis

Bioremediation is crucial for recuperating polluted water and soil. By expanding the surface area of substrates, biosurfactants play a vital role in bioremediation. Biosurfactant-producing microbes release certain biosurfactant compounds, which are promoted for oil spill remediation. In the present investigation, a biosurfactant-producing bacterium Bacillus tequilensis was isolated from Chilika Lake, Odisha, India (latitude and longitude: 19.8450 N 85.4788 E). Whole-Genome Sequencing (WGS) of Bacillus tequilensis was carried out using Illumina NextSeq 500. The size of the whole genome of Bacillus tequilensis was 4.47 MB consisting of 4,478,749 base pairs forming a circular chromosome with 528 scaffolds, 4492 protein-encoding genes (ORFs), 81 tRNA genes, and 114 ribosomal RNA transcription units. The total raw reads were 4209415, and the processed reads were 4058238 with 4492 genes. The whole genome obtained from the present investigation was used for genome annotation, variant calling, variant annotation, and comparative genome analysis with other existing Bacillus species. In this study, a pathway was constructed which describes the biosurfactant metabolism of Bacillus tequilensis. The study identified that genes such as SrfAD, SrfAC, SrfAA and SrfAB are involved in biosurfactant synthesis. The sequence of the genes SrfAD, SrfAC, SrfAA, SrfAB was deposited in GenBank database with accession MUG02427.1, MUG02428.1, MUG02429.1, MUG03515.1 respectively. The whole genome sequence was submitted to GenBank with an accession RMVO00000000 and the raw fastq reads were submitted to SRA, NCBI repository with an accession: SRX5023292.


Introduction
Heavy metal contamination has now become a serious ecological threat raising environmental concerns. Metals especially cadmium and zinc have posed a serious threat as their degradation to innocuous products is hard and takes millions of years [1][2][3]. Bioremediation systems which have been long proposed to neutralize metal contamination, however, have low bioavailability leading to an incomplete bioremediation process. Further, such bioremediation processes like phytoremediation with synthetic chelators are proven to be expensive and environmentally hazardous [4,5]. Various surface-active compounds (SACs) commonly biosurfactants produced by microorganisms have emerged as safe alternatives to chemical remediation [6][7][8].
The Whole-genome sequence represents a valuable shortcut, helping scientists to find genes much more easily and quickly. It is expected that being able to study the entire genome sequence will help in understanding how the genes endeavor together to direct the maintenance, development, and growth of a whole organism. Besides, it can use to predict the genes involved in the synthesizing of biosurfactants in microbes [9,10]. Therefore, the present study aimed to sequence the whole genome of biosurfactant-producing Bacillus tequilensis using Next-Generation sequencing, De-novo assembly, genome annotation, variant calling, and variant annotation.

Identification of biosurfactant producing Bacillus tequilensis
The majority of biosurfactants are produced by the microbes such as the Pseudomonas genus followed by Bacillus and Acinetobacter respectively [11]. In a previous investigation, a novel strain of Bacillus tequilensis was identified by various biochemical tests, microbial tests such as the Haemolysis test, oil spreading test, CTAB agar plate test, Drop collapse test, etc and the identification of the novel bacterial strain was performed by molecular characterization i.e 16S rRNA gene sequencing and phylogenetic assessment. The sequence of the bacteria, which was found to be novel, it was given the name Bacillus tequilensis strain ANSKLAB04 and deposited in GenBank with the accession number KU529483 [12]. The same novel strain was considered and employed for the current investigation of whole genome sequencing. In the present investigation, we have also concluded the novel strain by Average Nucleotide Identity (ANI) analysis and whole genome to genome comparison studies. According to Average Nucleotide Identity (ANI) analysis and digital DNA-DNA hybridization, Bacillus tequilensis's genome sequence was found to be more similar to Bacillus subtilis by 98.56% of Ortho ANI, 98.47% Original ANI, and Genome to Genome Distance count (GGDC) of 0.0146. Lower GGDC indicates a closer relationship and less gap (distance) between the species. Bacillus halotolerans and Bacillus tequilensis ANSKSLAB04 were found to have the second-highest similarity, with an Orotho ANI of 96.02%, an Original ANI of 95.98%, and a GGDC distance of 0.04 respectively. Similarly, next three similar species are Bacillus tequilensis KCTC 13622, Bacillus vallismortis and Bacillus mojavensis RO-H-1 = KCTC 3706 which are closely related to Bacillus tequilensis ANSKLAB04 [Fig 1] [ Table 1].

Bioanalyzer profile
The DNA isolation was performed using Phenol/Chloroform (PCl) genomic DNA extraction method [12]. The bioanalyzer profile of the prepared WGS library showed fragments in a size range of 300-600bp. The effective insert size of the library was 180-480bp flanked by adaptors having a combined size of~120bp. Based on the fragment distribution and concentration, the library was suitable for sequencing using Illumina platform.

Genome representation
The complete genome of Bacillus tequilensis consists of a single circular chromosome of 4,478,749 bp with an average G+C content of 46.33% ( Table 2 and S1 Table). The 4492 predicted coding ORFs cover 87% of the complete genome, and each ORF has a moderate length of 283 aa (S1 Table). Among these, 1,347, i.e. 67.4% were assigned as putative functions, 258, i.e. 12.9% matched to sustain hypothetical coding sequences of an anonymous function, and the rest 394, i.e. 19.7% shows no similarities to any known genes [ Table 3].
All genes were classified according to the COG classification. http://www.ncbi.nlm.nih.gov/ COG/ The variations in the nucleotide frequencies across the whole genome sequence were investigated using a non-overlapping active platform and by framing three indices of nucleotide frequency: G+C%, (G+C)/(A+T+C+G), divergence from [A] = [T], (A-T)/(A+T), and divergence from [C) = (G), (C-G)/(C+G). These 3 indices are, by representing, pairwise-independent and summarize relative nucleotide frequencies without loss of information. Because of their very  low frequency, ambiguous nucleotide bases were not taken into account. The SD (standard deviation) for the 3 indices is given by Where Normal distribution approximation was used as the total numbers of bases were large. The strand analyzed here was the 5' to 3' strand clockwise on the genetic map. A window size of 1 kb was used. From the inside: green and red bars represent RNA sequences on positive and negative strands respectively. Circle 1, represents G + C content (window size: 10Kb) higher and lower than 45%, where red represents higher and green represents lower. Circles 2:-represents GC skewness, where the green and red represents positive and negative value respectively [ Fig 2].

Gene ontology and biological annotation
The gene ontology analysis concluded that 18.99% of genes in Bacillus tequilensis belonged to transferase activity, 13.55% of genes belonged to kinase activity, 9.3% of genes were involved in ATP binding, 9.3% genes were involved with hydrolase activity, 6.91% genes were involved in methyl transferase activity, 5.98% of genes were associated with lipase activity, 5.98% of genes were involved in oxidoreductase activity, 4.9% of genes were in lyase activity, 3.05% genes were involved in peptidase activity, whereas only 2.79% genes were involved in cell division, 2.9% genes were in carbohydrate transport, 7.7% genes were in ribose production, and only 3.8% genes were involved in viral capsid [Fig 3].

Subsystem classification
Genes obtained from the whole genome of Bacillus tequilensis have been used for the classification of the subsystem. Subsystems were categorized based on the cofactors, cell wall, virulence metabolism, potassium metabolism, membrane transport, iron acquisition and metabolism, RNA metabolism, cell division, and cell cycle, motility, and chemotaxis, fatty acids, lipids and isoprenoids, nitrogen metabolism, etc were discussed in [S2

Metabolism of biosurfactant producing genes
Bacillus tequilensis produces a biosurfactant that belongs to the class of lipopeptides having excellent emulsifying properties and was capable of reducing the surface tension of water to a significantly lower value. The genes associated with producing biosurfactants are listed in [ Table 4]. Among the several different classes of biosurfactant-producing bacteria genera, the members of the genera Bacillus or Pseudomonas, due to their wide range of applications and resourcefulness can be more often used. Bacillus species are phenotypically and genotypically heterogeneous. Based on several investigations, a unique inhabitant of Bacillus sp. found at the  [15][16][17]. From Bacillus tequilensis we identified the SrfA which is involved in biosurfactant production and the sequence of the SrfA(242 aa) was deposited in GenBank with accession MUG02427.1. Besides, lichenysin is another lipopeptide biosurfactant produced by B. licheniformis coded by lichenysin operon (LchA) and comprises four peptide synthetase genes: LicAA, LicAB, LicAC, and LicAD. In another study, the authors isolated genes sfp (Phosphopantetheinyl transferase 224 amino acids) and mapped at 4kb downstream to operon srfA, and the authors also concluded it is essential for the post-translational changes to surfactin synthetase in microbes [15,16]. In this study, we have identified sfp gene from Bacillus tequilensis and the sequence of the sfp (Phosphopantetheinyl transferase 224 amino acids) was deposited in Gen-Bank with accession MUG02422.1.
Moreover, two operons, srfA and pps were found to be present in UMX-103 and B. subtilis 168 strains only involved in biosurfactant synthesis. The srfA operon contains four genes such as srfAA, srfAB, srfAC, and srfAD and the operon pps contains four genes named as ppsB, ppsC, ppsD, and ppsE. The genes, rmlA, rmlB, rmlC, and rmlD are only present in UMX -103 strains whereas, sigA, DnaK, and LytR are present specifically in Bacillus strain. Besides, the genes comA, comP, rpoN, abrB, and ResD are presented in both UMX-103 and B. subtilis 168 [18]. Based on the above literature biological annotation, we have identified DnaK and LytR genes from Bacillus tequilensis, and the sequence was deposited in NCBI with accession MUF99480.1 and MUG01692.1 respectively.
Pseudomonas species required Plasmid-encoded-rhlA, B, R and I genes of rhl quorumsensing system for the production of glycolipid biosurfactants as well as also involved in the production of rhamnolipids in a heterologous host. Iturin A is an antifungal lipopeptide biosurfactant produced by certain Bacillus subtilis strains such as Bacillus subtilis RB14 is composed of four ORF namely ituD, ituA, ituB, and ituC, whose disruption leads to specific deficiency in iturin A production. The three genes of arthrofactin operon of Pseudomonas namely arfA, arfB, and arfC encode ArfA, ArfB, and ArfC containing two, four, and five functional modules respectively required for condensation, adenylation and thiolation. Besides, Amphisin is produced by Pseudomonas sp. DSS73 requires gacS and amsY genes for the production of biosurfactant as these genes are mutants defective in the genes. Amphisin synthesis is regulated by the gacS gene as the gacS mutant regains the property of surface motility upon the introduction of a plasmid. Moreover, genes dnaK, dnaJ, and grpE positively regulate the biosynthesis of putisolvin [14]. Putisolvin biosynthesis genes such as dnaK, dnaJ, and grpE from Bacillus tequilensis were identified and the sequence was deposited in GenBank with accession MUF99480.1 MUF99481.1, MUF99479.1 respectively.
Acinetobacter species produces high molecular weight biosurfactants-Emulsan and Alasan with the involvement of gene. AlnA, AlnB and AlnC are essential for Alasan biosynthesis whereas wza, wzb, wzc, wzx, and wzy are required for Emulsan biosynthesis. For the production of fungal biosurfactants, emt1 and cyp1 are the two genes involved in the synthesis of these glycolipids, and fb1 and hfb2 genes regulate the synthesis of hydrophobin [14]. Thus, gene plays a major role in the biosynthesis of various microbial surfactants, and hence the role of molecular genetics and gene regulation mechanisms in the production of biosurfactant is

Biosurfactant / Lipopeptide metabolism of Bacillus tequilensis
Considering the biosurfactant-producing genes described in various literature, we classified the genes of Bacillus tequilensis based on the established efficient biosurfactant activity and broad applications. Biosurfactant is proven to be promising; possessing unique properties of low toxicity and higher biodegradability. In the present investigation, we constructed a pathway that describes the biosurfactant metabolism of Bacillus tequilensis [ Fig 5]. The lipopeptide synthesized constitutes a long chain of fatty acids along with glutamate acid (Glu), leucine (Leu), aspartic acid (Asp), and valine (Val). The synthesis is non-ribosomal by a large multienzyme peptide, non-ribosome peptide synthases (NRPS). The peptide synthetase required for an amino acid moiety of surfactin is encoded by four open reading frames in the srfA operon Table 4. Biosurfactants producing genes of Bacillus species.

S.No
Gene involved in Biosurfactant production Reference
The biosynthesis of Glu, Asp, Val, and Leu, are considered as the intrinsic components of surfactin. Glu/Asp are synthesized by aspartate aminotransferases such as AspB and YhdR

PLOS ONE
were identified from Bacillus tequilensis and the sequence was deposited to GenBank using accession MUF99794.1 and MUF99877.1respectively. The efficient fatty acid biosynthesis pathway determines efficient surfactin production. The building precursor acetyl-CoA initiates the biosynthesis of fatty acid. The biosynthesis of surfactin is catalyzed through NRPS, initiated by the condensation of fatty acids and Glu. Other constituent amino acids are assembled through the NRPS multi-enzyme complex, comprising adenylation, condensation, and thiolation domains responsible for the activation of amino acids and peptide chain elongation.

Genome evolution of B. tequilensis
The enormous genomic data obtained from sequencing of Bacillus tequilensis ANSKLAB04 was aligned against the existing top 20 homologous species of Bacillus in the NCBI database.

SNP and indel discovery
Our Indel discovery strategy involved mining insertion and deletion polymorphisms from DNA sequencing traces that originally were generated by genome centers for SNP discovery. The obtained mass-sequenced data of Bacillus tequilensis ANSKLAB04 were used to search for genetic variation against existing homologous biosurfactant-producing bacteria from Gen-Bank. The present investigation used the existing 5 homologous genomes of bacteria such as Bacillus tequilensis KCTC 13622, Bacillus subtilis, Bacillus mojavensis, Bacillus vallismortis, Bacillus halotolerans. The number of mapped sites per sample, mapping coverage, the total number of reads, the number of mapped reads, overall mapping ratio, the number of mapped bases, and the average alignment depth were calculated. Table 6 represents the statistics of Bacillus tequilensis in comparison with 5 existing homologous bacterial genome which includes Bacillus tequilensis (KCTC 13622), Bacillus halotolerans, Bacillus subtilis, Bacillus mojavensis, and Bacillus vallismortis. The number of total reads in all the reference genome was 6,229,938 which were constant in all reference bacteria. The mean depth indicates the number of reads, on average, that were likely to be aligned at a given reference base position in comparison with Bacillus tequilensis. However, Bacillus subtilis was having 90.39% of mapped After removing duplicates with Sambamba and identifying variants with SAMTools, information of each variant was gathered and classified by chromosomes or scaffolds. Table 7 shows the summary of the variant calling of Bacillus tequilensis ANSKLAB04 against other existing genomes in the database. Table 7 represents the summary of variant calling of Bacillus tequilensis against the existing top 5 homologous references bacterial genome which includes Bacillus tequilensis (KCTC    Fig 9A], Bacillus subtilis [ Fig 9B], Bacillus vallismortis [ Fig 9C], Bacillus halotolerans [ Fig 9D], Bacillus mojavensis [ Fig 9E].

Transition and transversion information.
The number of transition (Ts) and transversion (Tv), and the Ts/Tv ratio were calculated using the base change count. Base

PLOS ONE
percentage of Ts/Tv was 1.55% which was estimated by pairwise sequence comparison. On the other side, Bacillus subtilis was having the lowest count of total SNPs, Transition, and Transversion but had the highest Ts/Tv ratio i. e 2.15%. Transition indicative number of A to T and C to G conversion or interchange and vice-versa whereas transversion is indicative of A to C or A to G or T to C or T to G or vice-versa as shown in [Fig 10]. Bacillus subtilis was having more transitions in comparison with Bacillus tequilensis (Fig 10B) i.e. 68.2%. The number of transversions was more in Bacillus halotolerans and Bacillus mojavensis i.e. 39.1%. However, in all 5 reference genomes in comparison with Bacillus tequilensis, the count percentage of Transition was more than compared to transversion (Fig 10D and 10E]. Transitions are less likely to result in amino acid substitutions and are therefore more likely to persist as "silent substitutions" in populations as single nucleotide polymorphisms (SNPs).

Variant annotation
To find out the annotation information such as amino acid changes by variants, SnpEff was used.  There was only 1 count of frameshift variant which indicated a disruption of the translational reading frame because the number of nucleotides inserted or deleted was not a multiple of three which was almost negligible. Table 11 represents the annotation type count of Bacillus tequilensis ANSKLAB04 when aligned with Bacillus subtilis. There were 7 types of annotations found in Bacillus tequilensis ANSKLAB04 when aligned with Bacillus subtilis, which include upstream gene variant, downstream gene variant, intergenic region, synonymous variant, missense variant, initiator codon variant, and disruptive inframe insertion. There upstream gene variant was having a maximum ratio of 96.92% with 47,287 indicative of a sequence variant located at 5' of a gene whereas the downstream gene variant is indicative of a sequence variant located at 3' of a gene which was 1,098 (2.25%). Here synonymous variant count was 31 (0.06%) which was indicative of a sequence variant where there is no resulting change to the encoded amino acid. Table 12 represents the annotation type count of Bacillus tequilensis ANSKLAB04 when aligned with Bacillus vallismortis. There were 9 types of annotations found in Bacillus tequilensis ANSKLAB04 when aligned with Bacillus vallismortis. There upstream gene variant was having a maximum ratio of 98.67% with 270,075 indicative of a sequence variant located at 5' of a gene whereas the downstream gene variant was indicative of a sequence variant located at 3' of a gene which was 3,200 (1.17%). Table 13 represents the annotation type count of Bacillus tequilensis ANSKLAB04 when aligned with Bacillus halotolerans. There were various types of annotation found in Bacillus One or many codons are inserted (e.g.: An insert multiple of three in a codon boundary). MODERATE disruptive_inframe_insertion One codon is changed and one or many codons are inserted (e.g.: An insert of size multiple of three, not at codon boundary).

MODERATE inframe_deletion
One or many codons are deleted (e.g.: A deletion multiple of three at codon boundary). MODERATE disruptive_inframe_insertion One codon is changed and one or more codons are deleted (e.g.: A deletion of size multiple of three, not at codon boundary).  Table 14 represents the annotation type count of Bacillus tequilensis ANSKLAB04 when aligned with Bacillus mojavensis. There were various types of annotation found in Bacillus tequilensis ANSKLAB04 when aligned with Bacillus mojavensis. There upstream gene variant was having a maximum ratio of 97.66% with 331,498 indicative of a sequence variant located at 5' of a gene whereas the downstream gene variant was indicative of a sequence variant located at 3' of a gene which was 7,427 (2.19%).
Variant calling tool SnpEff reports the putative variant impact to make it easier and faster to categorize and prioritize variants. However, impact categories must be used with care as they were created only to help and simplify the filtering process. There is no way to predict whether a HIGH impact or a LOW impact variant is the one producing a phenotype of

Discussion
In the present investigation, we have introduced a high-quality draft genome sequence of Bacillus tequilensis, the first genome sequence of biosurfactant producing Bacillus tequilensis has been determined. Biosurfactant-producing microbes have potential applications in various biotechnology, biodegradation and pharmaceutical industries. The whole genome sequence of biosurfactant-producing Bacillus tequilensis will provide a foremost resource to start exploring the genes and gene products involved in biosurfactant synthesis. The genome sequence of Bacillus tequilensis obtained in the present investigation will be a key resource for the development of new concepts and techniques in genetic engineering such as molecular marker-assisted breeding and large-scale production of biosurfactant microbes for bioremediation.

Sample collection and DNA isolation
The strain was isolated from Chilika Lake, a brackish water lagoon, spread over the Puri, Khurda, and Ganjam districts of Odisha state on the east coast of India [12]. Water samples were collected from oil-contaminated sites of Chilika Lake, Odisha, India (latitude and longitude: 19.8450 N 85.4788 E), the largest brackish water lagoon in India. Various organisms were isolated and purified on culture plates and were then enriched in the mineral salt medium (MSM). MSM gives the nutrient condition for the production of biosurfactants by the organisms which were then screened for their biosurfactant production by various screening tests and the emulsification index was calculated. Identification of organisms was performed based on biochemical, macroscopic, and microscopic characteristics. The organism with the best emulsification index was then subjected to optimization for the production of biosurfactants for the factors affecting the production. Optimization was studied with the emulsification index calculated with each affecting factor. In a previous study, this organism was then subjected to 16S rRNA sequencing for the identification of the genus and species [12]. The DNA was isolated by Phenol/Chloroform (PCl) genomic DNA extraction method [12,19]. The bacterial cell pellet obtained after centrifugation was subjected to DNA isolation. The DNA concentration and purity were checked with a nanodrop spectrophotometer and qubit fluorometer [20].

Library preparation and genome sequencing
Library preparation was performed using the NEXTFlex DNA library protocol outlined in the "NEXTFlex" DNA sample preparation guide (Cat # 5140-02). In brief, genomic DNA was sheared to generate fragments of approximately 300-500bpin a Covaris micro Tube with the E220 system (Covaris, Inc., Woburn, MA, USA). The fragment size distribution was checked using Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA) with High Sensitivity DNA Kit (Agilent Technologies) according to the manufacturer's instructions. The resulting fragmented DNA was cleaned up using HighPrep beads (MagBio Genomics, Inc, Gaithersburg, Maryland). These fragments were subjected to end-repair, A-tailing, and ligation of the Illumina multiplexing adaptors using the NEXTFlex DNA Sequencing kit as per the manufacturer's instruction [21]. The resulting ligated DNA was cleaned up using HighPrep beads (MagBio Genomics, Inc, Gaithersburg, Maryland)and size selected (400-600bp) on 2% low melting agarose gel and cleaned using MinElute column (QIAGEN, India). These adapter-ligated fragments were subjected to 10 rounds of PCR (denaturation at 98˚C for 2 min, cycling (98˚C for the 30S, 65˚C for 30S and 72˚C for 1 min) and a final extension at 72˚C for 5 min) using primers provided in the NEXTFlex DNA Sequencing kit(Perkin Elmer). The PCR products were purified using HighPrep beads. Quantification and size distribution of the prepared library was determined using Qubit flourometer (Table 16) and the Agilent High Sensitivity DNA Kit (Agilent Technologies) respectively according to the manufacturer's instructions [ Fig 11]. Illumina Pairedend sequencing was performed using NextSeq 500: 150*2. The following adapters were used for sequencing (Illumina, Inc) [21].

Whole genome de-novo assembly and analysis
The obtained sequence raw reads were checked for quality control using the FASTQC tool [22]. The quality of the raw reads was checked through the various modules provided by the FASTQC tool. Among the modules, per base sequence quality and tile sequence quality modules were studied to validate the quality of the data for further analysis. The low-quality reads were excluded from the analysis using Trimmomatic (v0.36) [23]. The filtered De-novo assembly of Illumina paired-end data was assembled using SPAdes-v3.13.0 genome assembler-an open-source algorithm for De-novo assembly [24]. SPAdes assembler is intended for de-novo assembly after error correction of sequenced reads. Assembled contigs were further scaffolded using the SSPACE program [25]. A genome map was constructed using Circos [26].

Whole genome annotation and GO analysis
NCBI Prokaryotic Genome Annotation Pipeline (PGAP) version 4.8 was used to annotate the whole genome sequence [27]. Pathway Analysis was done by using the KAAS Server. Bacillus subtilis subsp 168 was taken as a reference organism for pathway analysis using KAAS server [28]. The functions of the predicted ORFs were categorized by comparison with the COG database [29]. Venn diagram was constructed using matplotlib-venn in Python [30]. Simple Sequence Repeats (SSR) were identified in each transcript sequence using the MISA Perl script [31].

Variant calling and variant annotation
Variant Calling of Bacillus tequilensis was performed by aligning with the top 5 existing homologous reference bacterial genome which includes Bacillus tequilensis(KCTC 13622), Bacillus halotolerans, Bacillus subtilis, Bacillus mojavensis, and Bacillus vallismortis. The present investigation used BWA (Burrows-Wheeler Aligner)-MEM for the alignment of Bacillus tequilensis against the top 5 homologous genomes [32]. During mapping, duplicated reads can falsely cause erroneous data to stand out. To prevent such errors, the Sambamba tool was used to remove the duplicate reads [32]. Duplicate reads are identified using mapping information such as start position, and CIGAR string [33]. SAMTools was used to manipulate the SAM/ BAM files that come out as a result of mapping [34]. In resequencing analysis, it is especially used for finding out variant information by calculating genotype likelihood from every position within the sample of analysis. Variant annotation was performed using SnpEff (v4.3t) [35]. SnpEff annotates the possible effects (on genes) that can be caused by variants identified through mapping. The present study used SnpEff to generate the Genes and transcripts affected by the variant, the location of the variants, and the information on how the variant affects the protein synthesis (e.g. generating a stop codon).
Supporting information S1